# Multimodal Visual Understanding
Qwen2.5 VL 3B Instruct GGUF
Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring powerful visual understanding and multimodal processing capabilities.
Image-to-Text English
Q
unsloth
4,645
4
PE Lang G14 448
Apache-2.0
The Perception Encoder is a state-of-the-art image and video understanding encoder trained through vision-language training, with strong generalization capabilities.
Text-to-Image
P
facebook
247
11
PE Lang L14 448
Apache-2.0
The Perception Encoder (PE) is an advanced image and video understanding encoder trained through vision-language learning, achieving state-of-the-art performance on various visual tasks.
Text-to-Image
P
facebook
1,087
6
Space Model
Apache-2.0
Qwen2.5-VL-32B-Instruct is the latest vision-language model in the Qwen family, featuring powerful visual understanding and intelligent agent capabilities, supporting multimodal task processing.
Image-to-Text
Transformers Supports Multiple Languages

S
Alhdrawi
58
1
Qwen2.5 VL 7B Instruct GGUF
Apache-2.0
Qwen2.5-VL-7B-Instruct is a multimodal vision-language model that supports image understanding and text generation tasks.
Image-to-Text English
Q
Mungert
17.10k
10
Qwen2.5 VL Instruct 3B Geo
Apache-2.0
Qwen2.5-VL is the latest vision-language model in the Qwen family, focusing on enhanced visual understanding and agent capabilities.
Text-to-Image
Transformers English

Q
kxxinDave
29
2
Mlabonne Gemma 3 4b It Abliterated GGUF
This is a quantized version based on the mlabonne/gemma-3-4b-it-abliterated model, using llama.cpp for imatrix quantization, suitable for image-text-to-text tasks.
Image-to-Text
M
bartowski
9,164
8
Toriigate V0.4 7B I1 GGUF
Apache-2.0
This is a weighted/importance matrix quantized version of the Minthy/ToriiGate-v0.4-7B model, offering multiple quantization options to suit different needs.
Image-to-Text English
T
mradermacher
410
1
Qwen2.5 VL 72B Instruct AWQ Fix
Other
Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring powerful visual understanding and agent capabilities, supporting multi-format visual localization and structured output generation.
Image-to-Text
Transformers English

Q
Benasd
94
1
Qwen2.5 VL 72B Instruct AWQ
Other
Qwen2.5-VL is a multimodal large language model launched by the QwenLM team, featuring powerful visual understanding and intelligent agent capabilities, supporting various input formats including images, videos, and text.
Text-to-Image
Transformers English

Q
Benasd
173
6
Qwen2.5 VL 7B Instruct AWQ
Apache-2.0
Qwen2.5-VL is a multimodal vision-language model launched by Tongyi Qianwen, featuring powerful image understanding and text generation capabilities.
Image-to-Text
Transformers English

Q
Benasd
226
7
Minicpm O 2 6 Gguf
MiniCPM-o 2.6 is a multimodal model that supports vision and language tasks, specifically designed for llama.cpp.
Image-to-Text
M
openbmb
5,660
101
Razorback 12B V0.2
Other
Razorback 12B v0.2 is a multimodal model combining the strengths of Pixtral 12B and UnslopNemo v3, featuring visual understanding and language processing capabilities.
Image-to-Text
Transformers Supports Multiple Languages

R
nintwentydo
17
3
Llama 3.2 90B Vision Instruct Unsloth Bnb 4bit
Meta Llama 3.2 series 90B-parameter multimodal large language model supporting visual instruction understanding, optimized with Unsloth dynamic 4-bit quantization
Text-to-Image
Transformers English

L
unsloth
58
2
Minicpm V 2 6 Rk3588 1.1.4
MiniCPM-V 2.6 is a GPT-4V-level multimodal large language model supporting single-image, multi-image, and video understanding, optimized for RK3588 NPU
Image-to-Text
Transformers Other

M
c01zaut
31
3
Cambrian 8b
Apache-2.0
Cambrian is an open-source multimodal LLM (Large Language Model) designed with a vision-centric approach.
Text-to-Image
Transformers

C
nyu-visionx
565
63
Phi 3 Vision 128k Instruct
MIT
Phi-3-Vision-128K-Instruct is a lightweight, cutting-edge open multimodal model supporting a 128K token context length, focusing on high-quality reasoning in text and visual domains.
Image-to-Text
Transformers Other

P
microsoft
25.19k
958
Llava Phi 3 Mini 4k Instruct
MIT
A vision-language model that combines the Phi-3-mini-3.8B large language model with LLaVA v1.5, providing advanced vision-language understanding capabilities.
Image-to-Text
Transformers

L
MBZUAI
550
22
Owlv2 Base Patch16
OWLv2 is a vision-language pre-trained model focused on object detection and localization tasks.
Object Detection
Transformers

O
Xenova
17
0
Featured Recommended AI Models